Clustering Deep Web Databases Semantically
نویسندگان
چکیده
Deep Web database clustering is a key operation in organizing Deep Web resources. Cosine similarity in Vector Space Model (VSM) is used as the similarity computation in traditional ways. However it cannot denote the semantic similarity between the contents of two databases. In this paper how to cluster Deep Web databases semantically is discussed. Firstly, a fuzzy semantic measure, which integrates ontology and fuzzy set theory to compute semantic similarity between the visible features of two Deep Web forms, is proposed, and then a hybrid Particle Swarm Optimization (PSO) algorithm is provided for Deep Web databases clustering. Finally the clustering results are evaluated according to Average Similarity of Document to the Cluster Centroid (ASDC) and Rand Index (RI). Experiments show that: 1) the hybrid PSO approach has the higher ASDC values than those based on PSO and K-Means approaches. It means the hybrid PSO approach has the higher intra cluster similarity and lowest inter cluster similarity; 2) the clustering results based on fuzzy semantic similarity have higher ASDC values and higher RI values than those based on cosine similarity. It reflects the conclusion that the fuzzy semantic similarity approach can explore latent semantics.
منابع مشابه
Model-based Clustering on Semantically Heterogeneous Distributed Databases on the Internet
The vision of the Semantic Web brings challenges to knowledge discovery on databases in such heterogeneous distributed open environment. The databases are developed independently with semantic information embedded, and they are heterogeneous with respect to the data granularity, ontology/scheme information etc. The Distributed knowledge discovery (DKD) methods are required to take semantic info...
متن کاملClassification of Deep Web Databases Based on the Context of Web Pages
New techniques are discussed for enhancing the classification precision of deep Web databases, which include utilizing the content texts of the HTML pages containing the database entry forms as the context and a unification processing for the database attribute labels. An algorithm to find out the content texts in HTML pages is developed based on multiple statistic characteristics of the text b...
متن کاملSampling the National Deep Web
A huge portion of today’s Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep Web, is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web or databases’ subject distribution are somewhat disputable. In this paper, we revisit a problem of deep Web characterization: ho...
متن کاملFocused Crawling of the Deep Web Using Service Class Descriptions
Dynamic Web data sources—sometimes known collectively as the Deep Web—increase the utility of the Web by providing intuitive access to data repositories anywhere that Web access is available. Deep Web services provide access to real-time information, like entertainment event listings, or present a Web interface to large databases or other data repositories. Recent studies suggest that the size ...
متن کاملStratification Based Hierarchical Clustering Over a Deep Web Data Source
This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by ...
متن کامل